coefs <- coef(m)
ses <- sqrt(diag(vcov(m)))
crit <- qt(0.975, df = df.residual(m))
upr <- coefs + (crit * ses)
lwr <- coefs - (crit * ses)Statistical thinking — simple linear models
2024-10-04
In this session we’ll start to think about modelling the relationships between two or more variables
By modelling the relationship we are suggesting a means by which the data we collected might have originated
We won’t be able to speak to causation in many cases because we have data collected as part of observational studies
But we can test hypotheses about the relationships between variables
We need to be careful about terminology
All the models we consider on this course are linear models
The phrases
all describe the same model
The model is one where the observations of the response are individually distributed Gaussian (are Gaussian random variables) with
\[\begin{align*} y_i & \sim \text{Gaussian}(\mu_i, \sigma) \\ g(\mu)_i & = \eta_i \\ \eta_i & = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \cdots + \beta_j x_{ji} \end{align*}\]
Linear regression can be thought of as an extension of the ideas behind correlations
The Gaussian linear model is a far more powerful statistical tool, however
Correlations tell us about the strength of the linear relationship between two variables \(x\) and \(y\)
A Gaussian linear model provides an equation for the line of best fit placed through the data
Further, in the linear regression the roles played by \(x\) and \(y\) are different; we say the values of \(y\) depend, to some degree, on the values of \(x\) measured on the same entities
Hence \(x\) plays the role of the predictor variable and \(y\) is the response
Simple linear regression is a statistical model that assumes a linear relationship between a continuous response variable \(y\) and a single continuous predictor variable
Three major purposes of such models
In this section we’ll consider the simplest case of a single predictor \(x\) & its relationship with \(y\)
A suitable model for this linear relationship is
\[\begin{align*} y_i & ~ \text{Gaussian}(\mu_i, \sigma) \\ g(\mu)_i & = \eta_i \\ \eta_i & = \beta_0 + \beta_1 x_{1i} \end{align*}\]
The \(\beta_j\) are the model parameters
\(\beta_0\) is the intercept, the expected value of \(y\) when \(x\) is 0
\(\beta_1\) is often called the slope, it measures the rate of change in \(y\) for a unit change in \(x\)
We estimate the two unknown parameters in the model using a procedure known as least squares, where we minimise the Residual Sum of Squares \(\mathrm{RSS} = \sum_{i=1}^n (y_i - \hat{y}_i)^2\)
Estimates of parameters (\(\beta_j\)) are for the population based on the fit to our sample of data
Data were 20 observations generated from the following model
\[\mu_i = 0.7 + 0.8x_i \;\;\;\; y_i \sim N(\mu_i, \sigma = 1)\]
Fitted model estimates are: \(\hat{\beta}_0\) = 0.32 and \(\hat{\beta}_1\) = 0.999
The parameters are means & the uncertainty in the estimated values is captured by their standard errors
Confidence intervals for the estimates:
Data are from a study of survival in horses with colic (abdominal pain) admitted to The Royal Veterinary and Agricultural University, Copenhagen
pulse — heart rate measured in beats per minute,pcv — packed cell volume or hematocrit is the volume % of red blood cells in bloodPCV can become elevated due to, inter alia dehydration or pain
We will explore the relationship between PCV and heart rate
Call:
lm(formula = pulse ~ pcv, data = colic)
Residuals:
Min 1Q Median 3Q Max
-36.276 -11.783 -3.696 10.275 56.101
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.78068 3.77830 2.589 0.00998 **
pcv 1.27542 0.09356 13.631 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.95 on 405 degrees of freedom
(65 observations deleted due to missingness)
Multiple R-squared: 0.3145, Adjusted R-squared: 0.3128
F-statistic: 185.8 on 1 and 405 DF, p-value: < 2.2e-16
colic_null <- withr::with_seed(2,
colic |>
specify(pulse ~ pcv) |>
hypothesize(null = "independence") |>
infer::generate(reps = 1000) |>
calculate(stat = "slope")
)
obs_slope <- colic |>
specify(pulse ~ pcv) |>
calculate(stat = "slope")
colic_null |>
visualize() +
shade_p_value(obs_stat = obs_slope,
direction = "two-sided") Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.78068 3.77830 2.589 0.00998 **
pcv 1.27542 0.09356 13.631 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Estimate column contains the estimated parameters\(n\) - 2 degrees of freedom = 405
If \(t\) statistic lies in either rejection regions, reject H0 at \(\alpha\) = 0.05 level
Linear regression can be seen as a partitioning of variance in the data in amounts explained by variables in the model & unexplained variance
Mean squares can only be positive; even a variable unrelated to response will explain some variance
Compare the ratio of Mean square (the SSq normalised by degrees of freedom) with an F distribution with 1 & 21 degrees of freedom
Analysis of Variance Table
Response: pulse
Df Sum Sq Mean Sq F value Pr(>F)
pcv 1 53380 53380 186 <2e-16 ***
Residuals 405 116344 287
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\(F\) is the \(F\)-ratio, the ratio of the regression and residual variances
\[F = \frac{\sum\limits^n_{i=1}(\hat{y}_i - \bar{y})^2 / p}{\sum\limits^n_{i=1}(y_i - \hat{y}_i)^2 / [n-(p+1)]} = \frac{\mathrm{MS_{reg}}}{\mathrm{MS_{resid}}}\]
Probability of \(F\) greater than or equal to observed from \(F\)-distribution with \(p\) and \(n - (p + 1)\) degrees of freedom
Large values of F are evidence against the null hypothesis of no relationship
Reference distribution is \(\mathsf{F_{1,405}}\); all rejection region is in upper tail
\(R^2\) is a commonly reported measure of variance explained by the regression
\(R^2\) is the coefficient of determination, the ratio of the variance explained to the total variance
\[R^2 = \frac{\mathrm{SS_{reg}}}{\mathrm{SS_{reg} + RSS}} = 1 - \frac{\mathrm{SS_{resid}}}{\mathrm{SS_{total}}}\]
One problem with \(R^2\) is that increases as you add predictors to the model even if those predictors have no explanatory power
Adjusted \(R^2\) takes into account number of predictors in the model
\[R^2_{\mathrm{adj}} = 1 - \frac{\mathrm{SS_{resid}} / [n - (p + 1)]}{\mathrm{SS_{total}} / (n-1)}\]
Neither measure is a particularly useful summary; don’t indicate how the regression model will predict new observations
The effect size, the size of the model coefficients, is far more useful
To fit the model, we don’t need to make any assumptions, beyond linearity
Statistical inference on the estimated parameters depends on a number of assumptions
The linear model correctly describes the functional relationship between \(y\) and \(x\)
Effectively, this assumption is about whether the model we’ve fitted represents the true underlying relationship between \(x\) and \(y\)
Is the relationship well approximated via a straight line?
Would a curved line (polynomial) be better?
If we have the model form wrong, our model will be biased; fitted values different appreciably from the true values
If violated the estimate of predictor variances (\(\sigma^2\)) will be inflated
Incorrect model specification can show itself as patterns in the residuals
\(x_i\) are measured without error
This assumption states that we know the values of the predictor variable(s) \(x\) exactly
Allows us to isolate the error component as random variation in \(y\)
Estimates \(\hat{\beta}_j\) will be biased if there is error in \(x\)
This is often ignored in data analysis, but modern, advanced approaches can account for this extra source of variation
For any \(x_i\), the sampled \(y_i\) are independent Gaussian random variables
Variances are constant along the regression line/model
These 2 assumptions relate to distribution of the residuals (\(\varepsilon_i\)), or, conditional upon the values of \(x\), of \(y\)
\(\varepsilon_i\) are assumed to follow a Normal distribution with zero mean and constant variance
Independence and normality of errors allows us to use parametric theory for confidence intervals and hypothesis tests on the \(F\)-ratio.
Allows a single constant variance \(\sigma^2\) for the variance of the regression line/model
Each residual is drawn from the same distribution; Normal with mean zero, variance \(\hat{\sigma}^2\)
Non-constant variances can be recognised through plots of residuals (among others); residuals get wider as the values of \(y\) increase
Independence is a key assumption; knowing the value of one observation tells use nothing about another beyond their relationship through \(x\)
Data that have spatial or temporal components, or represent repeated observations on the same set of individuals are commonly encountered & violate this assumption
An outlier is an observation which is inconsistent with the rest of a sample
An observation can be an outlier due to the response variable(s) or one or more of the predictor variables having values outside their expected limits
An outlier may also result from: incorrect measurement, incorrect data entry, transcription error, recording error
Two main concepts
Leverage measures the degree to which individual observations affect the the fitted value for that observation
Leverage values are also known as hat values, as they are the values on the diagonal of the hat matrix, which projects the observed values onto the fitted values of the model
Hat matrix is so called because it puts a hat on \(\mathbf{Y}\): \(\mathbf{\hat{Y} = HY}\)
Leverage ranges from 1/n to 1
Observation has high leverage if its hat value is 2–3 times the average hat value: \(h = (p+1)/n\), where \(p+1\) is number of coefficients (inc. the intercept)
An observation that combines outlyingness with high leverage exerts an influence on the estimated regression coefficients
If an influential observation is deleted from the model, the estimated coefficients change substantially
\(\mathsf{dfbeta}_{ij}\) assesses the impact on the \(j\)th coefficient of deleting the \(i\)th observation
\[\mathsf{dfbeta}_{ij} = \beta_{j(-i)} - \beta_j\]
The \(\mathsf{dfbeta}_{ij}\) are expressed in the metric of the coefficient
A standardised version, \(\mathsf{dfbetas}_{ij}\) divides \(\mathsf{dfbeta}_{ij}\) by the standard error of \(\beta_j\)
Influential observations have \(\mathsf{dfbetas}_{ij} \geq 2 / \sqrt{n}\)
One problem with \(\mathsf{dfbetas}_{ij}\) is that there are so many numbers!
One for each observation for every \(\beta_j\): \(n \times (p+1)\)
, \(D_i\), is a scale invariant measure of distance between \(\beta_j\) and \(\beta_{j(-i)}\)
\[D_i = \frac{e^2_i}{s^2(p+1)} \times \frac{h_i}{1-h_i}\]
where \(e_i^2\) is the squared residual & \(h_i\) is the hat value for \(x_i\); \(s^2\) is the variance of the residuals
The first fraction is a measure of outlyingness, the second of leverage
\(D_i \geq 4 / (n - p - 1)\) suggested as a cutoff for high values of \(D_i\)
The simple regression model readily generalises to the situation where we have \(m\) predictors not just one
\[\begin{align*} y_i & \sim \text{Gaussian}(\mu_i, \sigma) \\ g(\mu)_i & = \eta_i \\ \eta_i & = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \cdots + \beta_j x_{ji} \end{align*}\]
Now we have \(m + 1\) parameters to estimate, one for intercept and one each for the \(m\) predictors \(x_m\). Can have as many as \(m = n - 1\) predictor variables
Simple description as a regression lines starts to be blurred; with \(m = 2\) we have a regression plane
But in other respects the model fitting, assumptions, etc remain the same
We do now have the problem of deciding which of the \(m\) predictions are related to \(y\)
Several additional variables were measured in the horse colic study
A number of predictor variables thought to affect the heart rate of animals admitted weith colic pulse
temp) of each horsecrt)sbe)pcv)Typical statistical output for the full model
colic2 <- colic |>
filter(complete.cases(colic))
colic_mr <- lm(pulse ~ pcv + sbe + temp + crt, data = colic2)
colic_mr_0 <- lm(pulse ~ 1, data = colic2)
print(summary(colic_mr), digits = 3)
Call:
lm(formula = pulse ~ pcv + sbe + temp + crt, data = colic2)
Residuals:
Min 1Q Median 3Q Max
-41.52 -9.87 -2.54 8.33 59.12
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -156.840 46.520 -3.37 0.00083 ***
pcv 0.679 0.105 6.45 3.6e-10 ***
sbe -0.751 0.170 -4.43 1.3e-05 ***
temp 4.559 1.230 3.71 0.00024 ***
crt 5.225 0.715 7.31 1.8e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 14.5 on 352 degrees of freedom
Multiple R-squared: 0.479, Adjusted R-squared: 0.473
F-statistic: 81 on 4 and 352 DF, p-value: <2e-16
No t tests are significant, but this is only a reflection of what would happen if you removed that variable from model leaving others in the model
However, the joint F test is significant, indicating that there is an effect in there somewhere
Sequential sums of squares (Type I), ordering matters & changes results
Analysis of Variance Table
Response: pulse
Df Sum Sq Mean Sq F value Pr(>F)
pcv 1 43826 43826 207.37 < 2e-16 ***
sbe 1 10227 10227 48.39 1.71e-11 ***
temp 1 3125 3125 14.78 0.000143 ***
crt 1 11291 11291 53.43 1.82e-12 ***
Residuals 352 74393 211
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Sequential sums of squares (Type I), ordering matters & changes results
Anova Table (Type II tests)
Response: pulse
Sum Sq Df F value Pr(>F)
pcv 8800 1 41.64 3.63e-10 ***
sbe 4145 1 19.61 1.27e-05 ***
temp 2903 1 13.74 0.000244 ***
crt 11291 1 53.43 1.82e-12 ***
Residuals 74393 352
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Variance (ANOVA) is the classic name for a linear model where the predictor (explanatory) variables are categorical
Earlier ANOVA used to partition variance in \(y\) into components explained by \(x_j\) & a residual component not explained by the regression model
A slightly more restricted view of ANOVA is that it is a technique for partitioning the variation in \(y\) into that explained by one or more categorical predictor variables
The categories of each factor are the groups or experimental treatments
ANOVA considers the different sources of variation that might arise on a data set
Of particular interest is on the differences in the mean value of \(y\) between groups
We can think of within-group and between-group variances
There Will always be variation between individuals but is this within-group variance large or small, relative to the variance between groups?
One of the complications surrounding ANOVA is the convoluted nomenclature used describe variants of the method
Variants commonly distinguished by the number of categorical variables in the model
Two-way and higher ANOVA potentially involve the consideration of factor—factor interactions
In a we have a single categorical variable \(x\) with two or more levels With two levels we have the same analysis as the t test
If we consider differences between animals of different breed, we might use an breed factor whose levels might be
If we’re testing the effect of parity, the factor might be parity with levels: 1, 2, 3, & 4+
Assume we have a single categorical variable \(x\) with three levels. The One-way ANOVA model using dummy coding or treatment contrasts is
\[y_i = \beta_0 + \beta_1D_{i1} + \beta_2D_{i2} + \varepsilon_i\]
Where \(D_{ij}\) is the coding for the \(j\)th level (group) for the \(i\)th observation
| Group | \(D_1\) | \(D_2\) |
|---|---|---|
| 1 | 0 | 0 |
| 2 | 1 | 0 |
| 3 | 0 | 1 |
Measure the between-group variance as the regression sums of squares
The within-group variance is the residual sums of squares
An omnibus test F statistic is used to test the null hypothesis of no differences among population group means
\[\mathsf{H_0:} \; \beta_1 = \beta_2 = 0\]
Low & colleagues (2016, PLOS One) conducted a study to examine the effects of two different anesthetics on aspects of mouse physiology.
Blood CO2 levels were recorded after 120 minutes as the response variable
mice <- read_csv("data/mice-anesthesia/mice-anesthesia.csv") |>
mutate(anesth = fct_recode(anesth, "Isoflurane" = "iso", "Alpha chloralose" = "ac")) |>
rename(anesthetic = anesth)
mice_labs <- labs(x = "Anesthetic", y = expression(CO[2]))
mice |>
ggplot(aes(y = co2, x = anesthetic)) +
geom_boxplot() +
geom_point(position = position_jitter(width = 0.1), aes(colour = anesthetic)) +
guides(colour = "none") +
mice_labsResults of fitting one-way ANOVA to the anesthesia data
Multiple comparisons
Term Estimate Std. Error z Pr(>|z|) S 2.5 %
Isoflurane - Alpha chloralose -20.9 6.76 -3.09 0.00198 9.0 -34.2
97.5 %
-7.66
Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high
Type: response